Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639

Draft
wants to merge 182 commits into
base: master
Choose a base branch
from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Sep 25, 2024

This supersedes #6389 (now using a fully C++ approach), #5695 (first attempt at supporting Functionary) and #9592 (more recent Python wrapper).

Background

It tackles two main problems related to tool calling:

  • Lazy grammars: Helping / forcing the model to follow the tool schemas w/ grammar constraints is tricky as in most cases the model may also output normal, unconstrained content (except if "tool_choice": "required" is specified in the request). It's not currently possible to say .* "<tool_call>" constrained "</tool_call>" as the leading .* will match eagerly. In [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 I was avoid this issue in the thoughtful_steps style, but the native tool call styles were still problematic.

    • Solved w/ lazy grammars activated by trigger words (similar to stop words, refactored into same implementation). Output is completely unconstrained before triggers, and completely constrained after, which allows for content vs. tool_call outputs, and even mixes of the two (for the few models that support that).

      • For Llama3.1-Instruct (cf. llama-stack-apps repo / these docs) for instance, triggers are <|python_tag|> and {"name": "toolN" (for each toolN in the list of tools in the request).
      • For Llama3.2-Instruct, we eagerly trigger on{" which isn't quite right but helps steer 1B & 3B models. Will try and detect model size to keep a more specific trigger for the bigger 3.2 models.
      • For Hermes Pro (cf. Hermes-Function-Calling repo), it's <tool_call>.
      • For Functionary v3.llama3, it's >>>toolN\n for each toolN.
      • For Functionary v3-llama3.1, it's <function= and <|python_tag|>
      • For Mistral Nemo, the trigger ought to be [TOOL_CALLS] but it doesn't seem to (ever?) be emitted, so we're triggering on {" instead for now.
      • For other models ("generic" tool call style), no lazy grammars are used, just a normal JSON schema that can contain schema-constrained tool calls or content (unless tool_choice is required)
  • Jinja chat templates for tool-call-able models are getting increasingly complex, and implementing each of them in C++ is a maintenance hazard.

    • Solved by implementing a minimal Jinja engine (minja.hpp), with just enough to render all the templates I could find in the wild. That's still a lot of code (2.5k LOC), but about 10x less so than Jinja2Cpp (not even counting its dependencies - it needs a subset of Boost and some C++ backfills). It's trivial to extend (say, to add support for a new filter / test), and it comes with decent error reporting and simple tests. And we could always switch to another implementation in the future.

With this intro out of the way, here are the parts of this PR that could possibly be sent separately (currently itemized to be reitemized as commits):

  • grammar_trigger_words + llama_antiprompts: refactors the stop logic (barebones Aho–Corasick algorithm to handle multiple stop words efficiently - with grammar trigger words we may have many), aligning cli & server (e.g. single-token stop logic) and handling grammar trigger words.

  • minja.hpp + test/{test-minja.cpp,update_jinja_goldens.py,chat/{contexts,templates,goldens}}: minimal Jinja templating engine and its tests against actual templates & a few test contexts (now in its own repo: https://github.com/google/minja)

  • Tool call grammar generation + output parsing logic for Llama 3.1, Functionary v3 (2 variants) and Hermes 2 Pro

  • Integration in llama-server (fenced by --jinja) w/ tools, tool_choice support + updated response_format compliance.

  • Minimal examples/agent with a tool call / action loop, barebones tools and instructions / support to run them in a siloed docker container (see usage below)

How to use / test

While any model should work (using generic support based on JSON schema constraints), this PR supports the native call style of a few models:

  • Llama 3.x
  • Functionary 3.x
  • Hermes 2/3, Qwen 2.5
  • Mistral Nemo.

For natively supported models, it's important to have the right template (it might not be in the GGUF; note that we prefer the tool_use variant of the Jinja template if it's present in the GGUF metadata). You can check which template is defined by inspecting http://localhost:8080/props, and inspect the logs for Tool call style: .

Here's how to run an agent w/ local tool call:

  • Install prerequisite: uv (used to simplify python deps)

  • Run llama-server w/ any model:

    make -j LLAMA_CURL=1 llama-server
    
    # Native support for Mistral Nemo, Qwen 2.5, Hermes 3, Functionary 3.x
    # Note that some of these GGUFs lack the right template, so we override it
    # (otherwise they'd use the generic tool call support, which may be less efficient
    # and consume more tokens)
    
    ./llama-server --jinja -fa -ctk q4_0 -ctv q4_0 --verbose \
      -hfr bartowski/Qwen2.5-7B-Instruct-GGUF -hff Qwen2.5-7B-Instruct-Q4_K_M.gguf
    
    ./llama-server --jinja -fa -ctk q4_0 -ctv q4_0 --verbose \
      -hfr NousResearch/Hermes-3-Llama-3.1-8B-GGUF -hff Hermes-3-Llama-3.1-8B.Q4_K_M.gguf \
      --chat-template-file <( python scripts/get_hf_chat_template.py NousResearch/Hermes-3-Llama-3.1-8B tool_use )
    
    ./llama-server --jinja -fa --verbose \
      -hfr meetkai/functionary-small-v3.2-GGUF -hff functionary-small-v3.2.Q8_0.gguf \
      --chat-template-file <( python scripts/get_hf_chat_template.py meetkai/functionary-medium-v3.2 )
    
    ./llama-server --jinja -fa --verbose \
      -hfr lmstudio-community/Llama-3.2-3B-Instruct-GGUF -hff Llama-3.2-3B-Instruct-Q6_K.gguf \
      --chat-template-file <( python scripts/get_hf_chat_template.py meta-llama/Llama-3.2-3B-Instruct )
    
    # Note the --special flag: this is needed b/c of a regression from the last merge, will fix!
    ./llama-server --jinja -fa -ctk q8_0 -ctv q8_0 --verbose --special \
      -hfr bartowski/Mistral-Nemo-Instruct-2407-GGUF -hff Mistral-Nemo-Instruct-2407-Q8_0.gguf \
      --chat-template-file <( python scripts/get_hf_chat_template.py mistralai/Mistral-Nemo-Instruct-2407 )
    
    # Generic support, e.g. Phi 3.5, Gemma 2b, but really anything goes
    
    ./llama-server --jinja -fa --verbose \
      -hfr bartowski/Phi-3.5-mini-instruct-GGUF -hff Phi-3.5-mini-instruct-Q4_K_M.gguf
    
    ./llama-server --jinja -fa --verbose \
      -hfr bartowski/gemma-2-2b-it-GGUF -hff gemma-2-2b-it-Q4_K_M.gguf
  • Run the tools in examples/agent/tools inside a docker container for some level of isolation (+ sneaky logging of outgoing http and https traffic: you wanna watch over those agents' shoulders for the time being 🧐). Check http://localhost:8088/docs to see the tools exposed.

    export BRAVE_SEARCH_API_KEY=... # Get one at https://api.search.brave.com/
    ./examples/agent/serve_tools_inside_docker.sh

    [!WARNING]
    The command above gives tools (and your agent) access to the web (and read-only access to examples/agent/**. You can loosen / restrict web access in examples/agent/squid/conf/squid.conf.

  • Run the agent with some goal

    uv run examples/agent/run.py "What is the sum of 2535 squared and 32222000403?"
    See output w/ Hermes-3-Llama-3.1-8B
    🛠️  Tools: python, fetch_page, brave_search
    ⚙️  python(code="print(2535**2 + 32222000403)")
    → 15 chars
    The sum of 2535 squared and 32222000403 is 32228426628.
    
    uv run examples/agent/run.py "What is the best BBQ joint in Laguna Beach?"
    See output w/ Hermes-3-Llama-3.1-8B
    🛠️  Tools: python, fetch_page, brave_search
    ⚙️  brave_search(query="best bbq joint in laguna beach")
    → 4283 chars
    Based on the search results, Beach Pit BBQ seems to be a popular and highly-rated BBQ joint in Laguna Beach. They offer a variety of BBQ options, including ribs, pulled pork, brisket, salads, wings, and more. They have dine-in, take-out, and catering options available.
    
    uv run examples/agent/run.py "Search (with brave), fetch and summarize the homepage of llama.cpp"
    See output w/ Hermes-3-Llama-3.1-8B
    🛠️  Tools: python, fetch_page, brave_search
    ⚙️  brave_search(query="llama.cpp")
    → 3330 chars
    Llama.cpp is an open-source software library written in C++ that performs inference on various Large Language Models (LLMs). Alongside the library, it includes a CLI and web server. It is co-developed alongside the GGML project, a general-purpose tensor library. Llama.cpp is also available with Python bindings, known as llama.cpp-python. It has gained popularity for its ability to run LLMs on local machines, such as Macs with NVIDIA RTX systems. Users can leverage this library to accelerate LLMs and integrate them into various applications. There are numerous resources available, including tutorials and guides, for getting started with Llama.cpp and llama.cpp-python.
    
  • To compare the above results w/ a cloud provider's tool usage behaviour, just set the --provider flag (accepts openai, together, groq) and/or use --endpoint, --api-key, and --model

    export LLAMA_API_KEY=...      # for --provider=llama.cpp https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
    export OPENAI_API_KEY=...     # for --provider=openai    https://platform.openai.com/api-keys
    export TOGETHER_API_KEY=...   # for --provider=together  https://api.together.ai/settings/api-keys
    export GROQ_API_KEY=...       # for --provider=groq      https://console.groq.com/keys
    uv run examples/agent/run.py "Search for, fetch and summarize the homepage of llama.cpp" --provider=openai

TODOs before undrafting:

  • Move minja to its own location w/ fuller testing (fuzzing, etc) or at least its own PR --> https://github.com/google/minja
  • Port former behave / feature tool call tests to new pytest setup (server : replace behave with pytest #10416)
  • Fix regression requiring --special for Nemo since last merge
  • e2e tests for agent
  • Add a way to require trigger word to be at start of output
  • Fix CI build (tests still failing on windows)
  • Support streaming (of content - as long as it doesn't trigger any partial antiprompt match - and of individual tool calls)
  • Implement strftime_now in minja (for Llama 3.2), also update today's date for Llama 3.1
  • Add Google search tool as alternative to Brave
  • Functionary v3.2: strip leading "all\n" in non-tool-call outputs for
  • Add grammar trigger words support to llama-cli
  • Support regexps as antiprompts? Would allow triggering tool call grammar for small Llama 3.2 models (1B, 3B) on (^|\n)?{" and otherwise not trigger spuriously elsewhere.
  • Add support for broken templates (GML3..., Command R Plus, DeepSeek)
  • Nemo: handle special [TOOL_CALLS] token
  • Qwen2.5-72B-Instruct
  • Llama: suspicious early terminations in hello world tests w/ using explicit python tool w/ json output (could be a failure to escape strings?). Also, need to keep special <|python_tag|> token
  • Bring back generic thoughtful_steps tool support from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 (using JSON structured output even with models not trained for tool calling)
  • Add support for {"type": "code_interpreter"} (special-cased by functionary-medium-v3.1's template), maybe using ipython automatically for llama 3.1
  • Support jinja templates that explode on system prompts (replicate current chat template handling that puts system in user)
  • Add more tests (heavy e2e w/ actual models, tool_choice = none, parallel tool call, etc)
  • Add configurable network isolation of tools w/ a proxy (also caches pip & deb packages & limits access to host)
  • KV cache saving / reuse (within session & beyond) in agent (--cache-prompt defaults to true; follow up will be to allow in-slot restoration and saving of cache, see this branch for instance
  • Add tool call grammar tests (although indirectly covered by server "required" test cases)
  • Add more tools (brave search) + agent examples
  • Refactorings?
    • Ideally would pass some kind of ChatHandler between OAI init & final callback, and make it handle streaming / non streaming cases? (should parallel tool calls be streamed?)
    • chat_template should maybe be resolved earlier? (now a llama_chat_template class)
    • llama_apply_chat_template would benefit from a massive facelift. Maybe passing in a struct? (have introduced a new C++ API llama_chat_template::apply)
    • llama_token_to_piece(ctx, token) should really take (model, token) instead, but that's a breaking API change
      • calls common-local _llama_token_to_piece that takes model. Moved llama_chat_template_from_model helper to common.cpp
  • Fix functionary-medium-* templates' golden generation
  • Add examples to server readme
  • Support key-value overrides for templates (e.g. builtin_tools and todays_date in llama3.1's template)
    • Done by tool call handler, not user-configurable
  • Unify test-chat-templates & test-minja (write each test case in a .jinja file)
    • Fix a couple of missing bos_token in the current chat template logic
  • Bring back agent / tool call loop example + python tools isolation in docker (examples/tool-call) from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389
  • Test w/ meetkai/functionary-small-v3.2

Possible follow ups:

  • Add tool call loop to the default web chat using Pyodide as a python interpreter?

@github-actions github-actions bot added testing Everything test related examples python python script changes server labels Sep 25, 2024
@ochafik ochafik changed the title Tool call support (Llama 3.1, Functionary 3.2, Hermes 2 Pro) & Minimalist Jinja template engine Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine Sep 25, 2024
@ochafik ochafik changed the title Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Sep 25, 2024
@ngxson
Copy link
Collaborator

ngxson commented Dec 1, 2024

Hey @ochafik , this is impressive! It's a nice idea to bring a jinja parser into llama.cpp

I'm interested in this direction. But the current PR is quite big to review. Do you think it's possible to split the jinja part into a dedicated PR?

Btw me, @Vaibhavs10 and @Rocketknight1 (Matt) can help to further improve the jinja implementation. My suggestions are:

  • We can have a first "it just work" version
  • Then, we can run that version on a set of known jinja templates on Hugging Face hub to see how many percentage can be parsed
  • Base on the result, we can decide if:
    • We should further improve the jinja engine
    • Or, having jinja + old heuristic method co-exist together

@ochafik
Copy link
Collaborator Author

ochafik commented Dec 4, 2024

Hey @ngxson, thanks for the enthusiasm!

As it turns out, I've just got the approvals today (🎉) from my employer to launch Minja in its own repo → https://github.com/google/minja (this way I'll be able to setup more tests - including fuzzing - and distinct CI and take some of the complexity away from this PR & llama.cpp in general - just copy minja.hpp as we do for json.hpp and httplib.h)

I'll resume updates to this PR, I've been experimenting along the lines of some of @ggerganov 's comments but needs more work.

run that version on a set of known jinja templates on Hugging Face hub to see how many percentage can be parsed

Here's the list of tested model templates: https://github.com/google/minja/blob/main/tests/CMakeLists.txt#L22

I'd love suggestions of additional models, feel free to open a bug or PR there with either things that work or things that don't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes script Script related server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants